Author: Indranil Ghosh

Title: An introduction to hands-on football data analysis in Python

Institute: School of Fundamental Sciences, Massey University

Twitter: @indraghosh314

Website: https://indrag49.github.io/

Date: 03-10-2021


Pycon%20Espana%20logo.svg

massey%20logo.png

Abstract¶

This talk teaches these simple concepts to those who want to start working on football data analysis:

  • How to get open access event data from statsbomb using statsbombpy,

  • How to draw a soccer pitch using mplsoccer,

  • How to visualize a pass network for a particular team in a particular match,

  • How to use NetworkX module to analyze the pass network,

  • How to implement computational geometric concepts like Convex Hulls, Voronoi diagrams, and Delaunay triangulations using the Python package scipy.spatial on football event and tracking data

Start with statsbombpy¶

  • Use pip to install statsbombpy by using the following command:
In [ ]:
pip install statsbombpy

The open data from Statsbomb can be accessed without any need of authentication from the user but it is always advised to go through the Terms & Conditions section stated at their documentation page.

  • Now we will go step by step to understand how to extract the relevant data. Before that, we need to import the statsbombpy package.
In [2]:
from statsbombpy import sb
  • We then import the numpy and the pandas packages that help us manipulate our datasets and perform analyses like data cleaning and data extraction.
In [3]:
import numpy as np
import pandas as pd
  • To get access to the Competitions dataset type the following:
In [4]:
comp = sb.competitions()
credentials were not supplied. open data access only
  • The dataset comp look like this:
In [5]:
comp.head(15)
Out[5]:
competition_id season_id country_name competition_name competition_gender season_name match_updated match_available
0 16 4 Europe Champions League male 2018/2019 2021-05-19T08:38:06.515138 2021-05-19T08:38:06.515138
1 16 1 Europe Champions League male 2017/2018 2021-01-23T21:55:30.425330 2021-01-23T21:55:30.425330
2 16 2 Europe Champions League male 2016/2017 2020-08-26T12:33:15.869622 2020-07-29T05:00
3 16 27 Europe Champions League male 2015/2016 2020-08-26T12:33:15.869622 2020-07-29T05:00
4 16 26 Europe Champions League male 2014/2015 2020-08-26T12:33:15.869622 2020-07-29T05:00
5 16 25 Europe Champions League male 2013/2014 2020-08-26T12:33:15.869622 2020-07-29T05:00
6 16 24 Europe Champions League male 2012/2013 2020-08-26T12:33:15.869622 2020-07-29T05:00
7 16 23 Europe Champions League male 2011/2012 2020-08-26T12:33:15.869622 2020-07-29T05:00
8 16 22 Europe Champions League male 2010/2011 2020-07-29T05:00 2020-07-29T05:00
9 16 21 Europe Champions League male 2009/2010 2020-07-29T05:00 2020-07-29T05:00
10 16 41 Europe Champions League male 2008/2009 2020-08-30T10:18:39.435424 2020-08-30T10:18:39.435424
11 16 39 Europe Champions League male 2006/2007 2021-03-31T04:18:30.437060 2021-03-31T04:18:30.437060
12 16 37 Europe Champions League male 2004/2005 2021-04-01T06:18:57.459032 2021-04-01T06:18:57.459032
13 16 44 Europe Champions League male 2003/2004 2021-04-01T00:34:59.472485 2021-04-01T00:34:59.472485
14 16 76 Europe Champions League male 1999/2000 2020-07-29T05:00 2020-07-29T05:00
  • We can extract the column names of comp to understand the dataset better and draw out relevant information from the same. Type the following:
In [6]:
print(comp.columns)
Index(['competition_id', 'season_id', 'country_name', 'competition_name',
       'competition_gender', 'season_name', 'match_updated',
       'match_available'],
      dtype='object')
  • Let us make sense of a particular row from the comp dataset. For example, if we look into the row where the competition_id is 16and the season_id is 1, we notice that the country_name is Europe, the competition_name is Champions League, the season_name is 2017/2018, and so on. Suppose we are satisfied with the above information, and we want to analyze a game from 1017/18's Champions League season. We keep note of the competition_id and season_id at that row, which are 16 and 1 respectively. Now we extract out the matches dataset by typing the following:
In [7]:
mat = sb.matches(competition_id = 16, season_id = 1)
credentials were not supplied. open data access only
  • The dataset mat looks like this:
In [8]:
mat
Out[8]:
match_id match_date kick_off competition season home_team away_team home_score away_score match_status match_status_360 last_updated last_updated_360 match_week competition_stage stadium referee data_version shot_fidelity_version xy_fidelity_version
0 18245 2018-05-26 20:45:00.000 Europe - Champions League 2017/2018 Real Madrid Liverpool 3 1 available unscheduled 2021-01-23T21:55:30.425330 None 7 Final NSK Olimpijs'kyj M. Mažić 1.1.0 2 2
  • Evidently, the mat dataset gives us the match ids, the match dates, the kick off times, the home and away teams, the scores in a particular match, the name of the referee who officiated the match and so on. Here match_id is the unique id that will help us draw out event data for a particular match from 2017/18's Champion's League season. Let us get the event data from a match. We see there is only one match available, with match_id = 18245, which was the Champions League final match between Real Madrid and Liverpool ⚽ that took place at the Olimpiyskiy National Sports Complex, Moscow stadium and it ended up 3-1 in Real Madrid's favor 👀 👀 👀 👀. A great feat to be honest! Let us obtain the event data for this match.
In [9]:
events = sb.events(match_id = 18245)
credentials were not supplied. open data access only
  • The dataset events fetching us the event data for the particular match looks like this:
In [10]:
events
Out[10]:
50_50 ball_receipt_outcome ball_recovery_recovery_failure block_offensive carry_end_location clearance_aerial_won clearance_body_part clearance_head clearance_left_foot clearance_right_foot ... shot_statsbomb_xg shot_technique shot_type substitution_outcome substitution_replacement tactics team timestamp type under_pressure
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN {'formation': 41212, 'lineup': [{'player': {'i... Real Madrid 00:00:00.000 Starting XI NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN {'formation': 433, 'lineup': [{'player': {'id'... Liverpool 00:00:00.000 Starting XI NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN Real Madrid 00:00:00.000 Half Start NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN Liverpool 00:00:00.000 Half Start NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN Liverpool 00:00:00.000 Half Start NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3492 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN Real Madrid 00:42:21.211 Offside NaN
3493 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN Real Madrid 00:48:31.725 Half End NaN
3494 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN Liverpool 00:48:31.725 Half End NaN
3495 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN Liverpool 00:48:02.893 Half End NaN
3496 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN Real Madrid 00:48:02.893 Half End NaN

3497 rows × 86 columns

  • We see that we were able to get access to all the events from the Real Madrid vs. Liverpool match. We can jot down the column names to get a clearer overview of what kinds of events to expect from the match.
In [11]:
print(events.columns)
Index(['50_50', 'ball_receipt_outcome', 'ball_recovery_recovery_failure',
       'block_offensive', 'carry_end_location', 'clearance_aerial_won',
       'clearance_body_part', 'clearance_head', 'clearance_left_foot',
       'clearance_right_foot', 'counterpress', 'dribble_nutmeg',
       'dribble_outcome', 'dribble_overrun', 'duel_outcome', 'duel_type',
       'duration', 'foul_committed_advantage', 'foul_committed_card',
       'foul_committed_type', 'foul_won_advantage', 'foul_won_defensive',
       'goalkeeper_body_part', 'goalkeeper_end_location', 'goalkeeper_outcome',
       'goalkeeper_position', 'goalkeeper_punched_out', 'goalkeeper_technique',
       'goalkeeper_type', 'id', 'index', 'injury_stoppage_in_chain',
       'interception_outcome', 'location', 'match_id', 'minute', 'off_camera',
       'out', 'pass_aerial_won', 'pass_angle', 'pass_assisted_shot_id',
       'pass_body_part', 'pass_cross', 'pass_cut_back', 'pass_end_location',
       'pass_goal_assist', 'pass_height', 'pass_inswinging', 'pass_length',
       'pass_miscommunication', 'pass_outcome', 'pass_outswinging',
       'pass_recipient', 'pass_shot_assist', 'pass_straight', 'pass_switch',
       'pass_technique', 'pass_through_ball', 'pass_type', 'period',
       'play_pattern', 'player', 'position', 'possession', 'possession_team',
       'related_events', 'second', 'shot_aerial_won', 'shot_body_part',
       'shot_end_location', 'shot_first_time', 'shot_freeze_frame',
       'shot_key_pass_id', 'shot_one_on_one', 'shot_outcome', 'shot_redirect',
       'shot_statsbomb_xg', 'shot_technique', 'shot_type',
       'substitution_outcome', 'substitution_replacement', 'tactics', 'team',
       'timestamp', 'type', 'under_pressure'],
      dtype='object')
  • This completes our section on how to get access to open event data for a particular football match. We need to filter out only those events on which we want to perform advanced mathematical analyses and build conclusions. Next, we will learn how to visualize a football pitch using mplsoccer.

Draw a Football Pitch¶

  • If you do not want to recreate a football pitch manually using Python (which would be rather tedious) you can simply use the mplsoccer module without any concern. To my knowledge it provides with the best functionalities to draw a football pitch. This package is maintained by Anmol Durgapal and Andrew Rowlinson.

  • Keep in mind you can do a lot more advanced visualization stuffs using mplsoccer besides drawing a football pitch. We will encounter them as we move forward with other posts later. For now let us focus on visualizing a pitch in the simplest way possible. We need to pip install the package first:

In [12]:
pip install mplsoccer
Requirement already satisfied: mplsoccer in c:\users\indra\anaconda3\lib\site-packages (0.0.23)
Note: you may need to restart the kernel to use updated packages.
  • Note that mplsoccer uses Python 3.6+. Next we need to import matplotlib and the Pitch classes.
In [14]:
import matplotlib.pyplot as plt
from mplsoccer.pitch import Pitch
  • Let us try to draw the simplest football pitch that satisfies our visualization needs.
In [15]:
pitch = Pitch(pitch_color = 'grass', line_color = 'white', stripe = True, constrained_layout = True,
        tight_layout = False, goal_type = 'box', label = True,  axis = True, tick = True)
fig, ax = pitch.draw()
plt.show()
  • Let us try to understand what is happening here. Personally, I like setting the pitch_color argument to 'grass' giving an impression of a real life football pitch. Note that any other color can be set, for example, 'black' or any color represented by its hex code. Discarding the stripe argument removes the darker stripes that appear on the pitch. The line_color is self-explanatory and the user can change its color too according to their need. By default, the axis, labels and the ticks representing the scales are switched off. The user can turn it on by setting label, axis and tick arguments to be True, as evident in the above pitch. Let us draw a different pitch with its color changed and stripes removed.
In [16]:
pitch = Pitch(pitch_color='black', line_color = 'white', constrained_layout = True,
        tight_layout = False, goal_type = 'box', label = True,  axis = True, tick = True)
fig, ax = pitch.draw()
plt.show()
  • Now let us focus on the axis range for a moment. By default the Pitch() function sets the pitch type to be statsbomb where the y-axis is inverted and ranges from 80 to 0. The x-axis ranges from 0 to 120. We will be mostly working with statsbomb data, so, these orientations of the axes won't be of much concern. Nevertheless this information is way too useful and we must keep this in mind, in case we deal with football data from other sources.

  • To be precise, there are eight different pitch types that mplsoccer provides us with. They are 'statsbomb', 'opta', 'tracab', 'skillcorner', 'wyscout','metricasports', 'uefa', and 'custom'. This can be set using the pitch_type argument inside the Pitch() function. Let us check the orientation of the uefa pitch type:

In [17]:
pitch = Pitch(pitch_color='grass', stripe = True, pitch_type = 'uefa', line_color = 'white', constrained_layout = True,
        tight_layout = False, goal_type = 'box', label = True,  axis = True, tick = True)
fig, ax = pitch.draw()
plt.show()
  • The reader might have noticed that by default, the pitch has a horizontal appearance. If the user wants it to be vertical, they should pass an additional argument orientation and set it to 'vertical'.
In [18]:
pitch = Pitch(orientation = 'vertical', pitch_color = 'grass', line_color = 'white', stripe = True, constrained_layout = True,
        tight_layout = False, goal_type = 'box')
fig, ax = pitch.draw()
plt.show()
  • The user can also make the pitch appear half by setting the view argument to be 'half'.
In [19]:
pitch = Pitch(view = 'half', pitch_color = 'grass', line_color = 'white', stripe = True, constrained_layout = True,
        tight_layout = False, goal_type = 'box')
fig, ax = pitch.draw()
plt.show()
  • These are the most basic concepts covering the topic of drawing and visualizing a football pitch using mplsoccer. The pitches can be further customized to meet the users' visualization needs. Keep an eye on the mplsoccer documentation to learn more about the same. In the next section, we will learn how to visualize a pass network for a particular team from a match and analyze the network with the help of NetworkX Python package. This package will help us use basic concepts from complex network analysis literature to analyze the network and deduce some interesting properties from the same.

Visualize a Pass Network¶

  • We will employ the NetworkX Python package for the analysis purpose.
  • You can either watch the soccer-analysis tutorial by McKay Johns or read my blog to get this part.
  • Using by the way given here by mplsoccer, we will take the average locations of the starting 11 players on the field for a unified construction of the pass network, and also will count the number of passes created by these player.
In [26]:
lineup_Real = pd.DataFrame.from_dict(dict_Real)
lineup_Real
Out[26]:
player position jersey_number
0 {'id': 5597, 'name': 'Keylor Navas Gamboa'} {'id': 1, 'name': 'Goalkeeper'} 1
1 {'id': 5721, 'name': 'Daniel Carvajal Ramos'} {'id': 2, 'name': 'Right Back'} 2
2 {'id': 5485, 'name': 'Raphaël Varane'} {'id': 3, 'name': 'Right Center Back'} 5
3 {'id': 5201, 'name': 'Sergio Ramos García'} {'id': 5, 'name': 'Left Center Back'} 4
4 {'id': 5552, 'name': 'Marcelo Vieira da Silva ... {'id': 6, 'name': 'Left Back'} 12
5 {'id': 5539, 'name': 'Carlos Henrique Casimiro'} {'id': 10, 'name': 'Center Defensive Midfield'} 14
6 {'id': 5463, 'name': 'Luka Modrić'} {'id': 13, 'name': 'Right Center Midfield'} 10
7 {'id': 5574, 'name': 'Toni Kroos'} {'id': 15, 'name': 'Left Center Midfield'} 8
8 {'id': 4926, 'name': 'Francisco Román Alarcón ... {'id': 19, 'name': 'Center Attacking Midfield'} 22
9 {'id': 19677, 'name': 'Karim Benzema'} {'id': 22, 'name': 'Right Center Forward'} 9
10 {'id': 5207, 'name': 'Cristiano Ronaldo dos Sa... {'id': 24, 'name': 'Left Center Forward'} 7
In [27]:
lineup_Liv = pd.DataFrame.from_dict(dict_Liv)
lineup_Liv
Out[27]:
player position jersey_number
0 {'id': 3630, 'name': 'Loris Karius'} {'id': 1, 'name': 'Goalkeeper'} 1
1 {'id': 3664, 'name': 'Trent Alexander-Arnold'} {'id': 2, 'name': 'Right Back'} 66
2 {'id': 3471, 'name': 'Dejan Lovren'} {'id': 3, 'name': 'Right Center Back'} 6
3 {'id': 3669, 'name': 'Virgil van Dijk'} {'id': 5, 'name': 'Left Center Back'} 4
4 {'id': 3655, 'name': 'Andrew Robertson'} {'id': 6, 'name': 'Left Back'} 26
5 {'id': 3532, 'name': 'Jordan Brian Henderson'} {'id': 10, 'name': 'Center Defensive Midfield'} 14
6 {'id': 3567, 'name': 'Georginio Wijnaldum'} {'id': 13, 'name': 'Right Center Midfield'} 5
7 {'id': 3473, 'name': 'James Philip Milner'} {'id': 15, 'name': 'Left Center Midfield'} 7
8 {'id': 3531, 'name': 'Mohamed Salah'} {'id': 17, 'name': 'Right Wing'} 11
9 {'id': 3629, 'name': 'Sadio Mané'} {'id': 21, 'name': 'Left Wing'} 19
10 {'id': 3535, 'name': 'Roberto Firmino Barbosa ... {'id': 23, 'name': 'Center Forward'} 9
  • So, we have collected the names and the jersey number of the players (starting 11) from both the teams in separate dictionaries named players_Real and players_Liv. These will come handy later!

  • Now from the events dataset we will extract out the relevant columns for our pass network analysis purposes.

In [30]:
events_pn = events[['minute', 'second', 'team', 'type', 'location', 'pass_end_location', 'pass_outcome', 'player']]
  • The first 10 rows of the events_pn dataframe:
In [31]:
events_pn.head(10)
Out[31]:
minute second team type location pass_end_location pass_outcome player
0 0 0 Real Madrid Starting XI NaN NaN NaN NaN
1 0 0 Liverpool Starting XI NaN NaN NaN NaN
2 0 0 Real Madrid Half Start NaN NaN NaN NaN
3 0 0 Liverpool Half Start NaN NaN NaN NaN
4 45 0 Liverpool Half Start NaN NaN NaN NaN
5 45 0 Real Madrid Half Start NaN NaN NaN NaN
6 0 0 Liverpool Pass [60.0, 40.0] [32.1, 41.2] NaN James Philip Milner
7 0 3 Liverpool Pass [35.0, 40.8] [92.7, 22.7] Incomplete Dejan Lovren
8 0 8 Real Madrid Pass [27.4, 60.2] [36.1, 71.6] NaN Raphaël Varane
9 0 10 Real Madrid Pass [35.3, 75.4] [22.4, 76.6] NaN Luka Modrić
  • As we are only interested in the pass network generation, we will filter the datasets by keeping those rows where type is set to Pass.
In [36]:
events_pn_Real = events_Real[events_Real['type'] == 'Pass']
events_pn_Liv = events_Liv[events_Liv['type'] == 'Pass']
  • Again view the first 10 rows of the filtered datasets:
In [37]:
events_pn_Real.head(10)
Out[37]:
minute second team type location pass_end_location pass_outcome player
8 0 8 Real Madrid Pass [27.4, 60.2] [36.1, 71.6] NaN Raphaël Varane
9 0 10 Real Madrid Pass [35.3, 75.4] [22.4, 76.6] NaN Luka Modrić
10 0 11 Real Madrid Pass [22.3, 76.6] [33.4, 68.0] NaN Daniel Carvajal Ramos
11 0 15 Real Madrid Pass [36.2, 75.3] [43.6, 62.0] Incomplete Carlos Henrique Casimiro
16 0 25 Real Madrid Pass [14.7, 23.2] [56.7, 6.2] Incomplete Sergio Ramos García
17 0 40 Real Madrid Pass [57.5, 4.6] [49.2, 15.6] NaN Marcelo Vieira da Silva Júnior
18 0 43 Real Madrid Pass [48.8, 18.4] [49.8, 12.5] NaN Carlos Henrique Casimiro
19 0 46 Real Madrid Pass [48.8, 13.9] [36.1, 56.3] NaN Toni Kroos
20 0 52 Real Madrid Pass [41.3, 54.8] [34.4, 40.2] NaN Raphaël Varane
21 0 55 Real Madrid Pass [39.1, 36.5] [65.4, 13.1] NaN Sergio Ramos García
In [38]:
events_pn_Liv.head(10)
Out[38]:
minute second team type location pass_end_location pass_outcome player
6 0 0 Liverpool Pass [60.0, 40.0] [32.1, 41.2] NaN James Philip Milner
7 0 3 Liverpool Pass [35.0, 40.8] [92.7, 22.7] Incomplete Dejan Lovren
12 0 16 Liverpool Pass [76.5, 18.1] [84.8, 9.5] NaN Jordan Brian Henderson
13 0 18 Liverpool Pass [84.4, 10.0] [92.5, 19.1] NaN Sadio Mané
14 0 19 Liverpool Pass [91.6, 21.3] [90.6, 50.7] NaN Roberto Firmino Barbosa de Oliveira
15 0 22 Liverpool Pass [92.2, 50.9] [109.7, 46.4] Incomplete Mohamed Salah
25 1 7 Liverpool Pass [42.0, 75.9] [115.6, 59.3] Incomplete Trent Alexander-Arnold
37 2 0 Liverpool Pass [9.9, 39.1] [28.1, 4.2] NaN Virgil van Dijk
38 2 3 Liverpool Pass [43.2, 2.8] [50.1, 4.8] Incomplete Andrew Robertson
39 2 7 Liverpool Pass [53.2, 0.1] [50.0, 4.0] NaN Andrew Robertson
  • We will replace the player names with their jersey numbers and create another pair of new datasets:
In [79]:
pass_Real_new = pass_Real.replace({"pass_maker": players_Real, "pass_receiver": players_Real})
In [80]:
pass_Real_new
Out[80]:
index pass_maker pass_receiver number_of_passes pass_maker_x pass_maker_y count pass_receiver_x pass_receiver_y number_of_passes_received
0 0 14 2 1 60.845455 31.836364 11 64.341667 73.875 24
1 6 7 2 3 81.580000 29.160000 10 64.341667 73.875 24
2 21 22 2 2 62.323529 27.082353 17 64.341667 73.875 24
3 29 9 2 2 65.081818 27.936364 11 64.341667 73.875 24
4 39 10 2 10 60.604762 55.028571 21 64.341667 73.875 24
... ... ... ... ... ... ... ... ... ... ...
73 16 2 1 1 64.341667 73.875000 24 10.870000 41.810 10
74 30 9 1 1 65.081818 27.936364 11 10.870000 41.810 10
75 57 5 1 2 37.436364 58.354545 22 10.870000 41.810 10
76 64 4 1 1 41.282353 24.514706 34 10.870000 41.810 10
77 74 8 1 1 51.190000 24.275000 40 10.870000 41.810 10

78 rows × 10 columns

In [81]:
pass_Liv_new = pass_Liv.replace({"pass_maker": players_Liv, "pass_receiver": players_Liv})
In [82]:
pass_Liv_new
Out[82]:
index pass_maker pass_receiver number_of_passes pass_maker_x pass_maker_y count pass_receiver_x pass_receiver_y number_of_passes_received
0 12 5 26 4 76.390909 28.518182 11 59.815385 6.830769 13
1 18 7 26 1 72.353333 36.153333 15 59.815385 6.830769 13
2 28 14 26 1 61.035294 37.152941 17 59.815385 6.830769 13
3 36 1 26 1 12.914286 40.385714 7 59.815385 6.830769 13
4 54 66 26 1 64.666667 72.550000 12 59.815385 6.830769 13
... ... ... ... ... ... ... ... ... ... ...
59 55 66 6 1 64.666667 72.550000 12 41.690909 60.172727 11
60 61 4 6 3 43.366667 25.433333 9 41.690909 60.172727 11
61 25 7 19 2 72.353333 36.153333 15 86.275000 22.075000 4
62 33 14 19 1 61.035294 37.152941 17 86.275000 22.075000 4
63 43 11 19 1 77.550000 64.710000 10 86.275000 22.075000 4

64 rows × 10 columns

  • Now let us visualize the pass networks for both the teams.
In [83]:
pitch = Pitch(pitch_color='grass', goal_type = 'box', line_color='white', stripe = True, 
              constrained_layout=True, tight_layout=False)
fig, ax = pitch.draw()
arrows = pitch.arrows(pass_Real.pass_maker_x, pass_Real.pass_maker_y,
                         pass_Real.pass_receiver_x, pass_Real.pass_receiver_y, lw = 5,
                         color = 'black', zorder = 1, ax=ax)
nodes = pitch.scatter(av_loc_Real.pass_maker_x, av_loc_Real.pass_maker_y,
                           s=350, color = 'white', edgecolors='black', linewidth=1, alpha = 1, ax = ax)
                          
for index, row in av_loc_Real.iterrows():
    pitch.annotate(players_Real[row.name], xy=(row.pass_maker_x, row.pass_maker_y),
                   c ='black', va = 'center', ha = 'center', size = 10, ax = ax)
plt.title("Pass network for Real Madrid against Liverpool", size = 20)                   
plt.show()
In [84]:
pitch = Pitch(pitch_color='grass', goal_type = 'box', stripe = True, 
              line_color='white', constrained_layout=True, tight_layout=False)
fig, ax = pitch.draw()
arrows = pitch.arrows(120 - pass_Liv.pass_maker_x, pass_Liv.pass_maker_y,
                         120 - pass_Liv.pass_receiver_x, pass_Liv.pass_receiver_y, lw = 5,
                         color = 'black', zorder = 1, ax = ax)
nodes = pitch.scatter(120 - av_loc_Liv.pass_maker_x, av_loc_Liv.pass_maker_y,
                           s=350, color = 'red', edgecolors = 'black', linewidth=1, alpha = 1, ax = ax)
                           
for index, row in av_loc_Liv.iterrows():
    pitch.annotate(players_Liv[row.name], xy=(120 - row.pass_maker_x, row.pass_maker_y), 
                   c ='black', va = 'center', ha = 'center', size = 10, ax = ax)
plt.title("Pass network for Liverpool against Real Madrid", size = 20)
plt.show()
  • In case of Liverpool's pass network visualization, we subtract the x coordinates from 120 just to reverse the x-axis.
  • Now that we have been successful in correctly visualizing the pass networks of the teams involved in the game, we will now start analyzing our networks using metrics from the literature of complex network analysis.

  • Note that both of our networks are directed weighted graphs, with number of passes as the weight for a directed edge.

  • Let us first develop the isomorphic graph to the one we just visualized for Real Madrid, but this time using the networkx package. First we will use the relevant columns from the pass_Real_new dataset:

In [85]:
pass_Real_new = pass_Real_new[['pass_maker', 'pass_receiver', 'number_of_passes']]
pass_Real_new
Out[85]:
pass_maker pass_receiver number_of_passes
0 14 2 1
1 7 2 3
2 22 2 2
3 9 2 2
4 10 2 10
... ... ... ...
73 2 1 1
74 9 1 1
75 5 1 2
76 4 1 1
77 8 1 1

78 rows × 3 columns

  • We will next convert pass_Real_new to a list of tuples, where each row is converted to a tuple. This is required for drawing a networkx graph.
In [86]:
L_Real = pass_Real_new.apply(tuple, axis=1).tolist()
print(L_Real)
[('14', '2', 1), ('7', '2', 3), ('22', '2', 2), ('9', '2', 2), ('10', '2', 10), ('12', '2', 2), ('5', '2', 3), ('4', '2', 3), ('8', '2', 1), ('14', '10', 1), ('7', '10', 1), ('2', '10', 7), ('22', '10', 1), ('12', '10', 1), ('5', '10', 5), ('4', '10', 2), ('8', '10', 5), ('14', '12', 1), ('7', '12', 4), ('22', '12', 2), ('1', '12', 2), ('10', '12', 1), ('4', '12', 9), ('8', '12', 4), ('14', '5', 1), ('2', '5', 5), ('1', '5', 2), ('10', '5', 3), ('12', '5', 2), ('4', '5', 5), ('8', '5', 4), ('14', '4', 1), ('7', '4', 1), ('22', '4', 5), ('9', '4', 1), ('1', '4', 4), ('10', '4', 1), ('12', '4', 2), ('5', '4', 6), ('8', '4', 10), ('14', '8', 6), ('2', '8', 1), ('22', '8', 4), ('9', '8', 4), ('1', '8', 1), ('10', '8', 4), ('12', '8', 5), ('5', '8', 4), ('4', '8', 9), ('7', '9', 1), ('2', '9', 1), ('22', '9', 1), ('1', '9', 1), ('10', '9', 1), ('12', '9', 3), ('5', '9', 1), ('8', '9', 2), ('2', '14', 2), ('9', '14', 2), ('10', '14', 1), ('12', '14', 2), ('5', '14', 1), ('8', '14', 2), ('2', '7', 2), ('22', '7', 2), ('9', '7', 1), ('12', '7', 2), ('4', '7', 1), ('8', '7', 2), ('2', '22', 3), ('12', '22', 4), ('4', '22', 4), ('8', '22', 8), ('2', '1', 1), ('9', '1', 1), ('5', '1', 2), ('4', '1', 1), ('8', '1', 1)]
  • Now, we can draw the directed weighted graph:
In [87]:
G_Real = nx.DiGraph()

for i in range(len(L_Real)):
    G_Real.add_edge(L_Real[i][0], L_Real[i][1], weight = L_Real[i][2])

edges_Real = G_Real.edges()
weights_Real = [G_Real[u][v]['weight'] for u, v in edges_Real]

nx.draw(G_Real, node_size=800, with_labels=True, node_color='white', width = weights_Real)
plt.gca().collections[0].set_edgecolor('black') # sets the edge color of the nodes to black
plt.title("Pass network for Real Madrid vs Liverpool", size = 20)
plt.show()
  • Now for Liverpool too, let us first clean the pass_Liv_new dataset and then draw the isomorphic weighted directed graph:
In [88]:
pass_Liv_new = pass_Liv_new[['pass_maker', 'pass_receiver', 'number_of_passes']]
In [89]:
pass_Liv_new
Out[89]:
pass_maker pass_receiver number_of_passes
0 5 26 4
1 7 26 1
2 14 26 1
3 1 26 1
4 66 26 1
... ... ... ...
59 66 6 1
60 4 6 3
61 7 19 2
62 14 19 1
63 11 19 1

64 rows × 3 columns

In [90]:
L_Liv = pass_Liv_new.apply(tuple, axis=1).tolist()
G_Liv = nx.DiGraph()

for i in range(len(L_Liv)):
    G_Liv.add_edge(L_Liv[i][0], L_Liv[i][1], weight = L_Liv[i][2])

edges_Liv = G_Liv.edges()
weights_Liv = [G_Liv[u][v]['weight'] for u, v in edges_Liv]

nx.draw(G_Liv, node_size = 800, with_labels = True, node_color = 'red', width = weights_Liv)
plt.gca().collections[0].set_edgecolor('black') # sets the edge color of the nodes to black
plt.show()
  • Let us discuss some of the important functions from the networkx package that we have employed for drawing graphs:

    • DiGraph() function sets the base class for generating directed graphs,
    • add_edge() function adds an edge between two nodes given by the first two arguments and the weight parameter sets the weight for this edge
    • draw() function visualizes a networkx graph and its parameters are self-explanatory
  • Let us now understand the degree, indegree and outdegree of a node from a directed weighted graph. Indegree of a node is the total number of edges that are directed towards the node, i.e, for our case, the total number of passes received by a player (node). Similarly, outdegree means the total number of edges that are directed outwards from the node, i.e, the total number of passes given by a player. Finally, the degree of a node is the total number of edges connected to a node (ignoring the directions of the edges), i.e, sum of the total number of passes given and the total number of passes received by a player. It is evident that the degree of a node is the sum of its indegree and outdegree.

We will use networkx to find out the node degrees from the pass network of Real Madrid.

In [91]:
# Prepare a dictionary with jersey numbers as the node ids, 
# i.e, the dictionary keys and degrees as the dictionary values
deg_Real = dict(nx.degree(G_Real)) 
# convert a dictionary to a pandas dataframe
degree_Real = pd.DataFrame.from_dict(list(deg_Real.items())) 
degree_Real.rename(columns = {0:'jersey_number', 1: 'node_degree'}, inplace = True)
In [92]:
degree_Real
Out[92]:
jersey_number node_degree
0 14 12
1 2 17
2 7 11
3 22 11
4 9 14
5 10 15
6 12 16
7 5 14
8 4 17
9 8 19
10 1 10
  • Out of the 11 starting players for Real Madrid in that game, we notice that the player with jersey number 8 (i.e, Toni Kroos) had the highest degree value of 19. On second are ranked the players with jersey number 2 and 4 with degree value 17, i.e, our favorite Spanish defenders 'Daniel Carvajal Ramos' and 'Sergio Ramos García' respectively. Tremendous! Let us use seaborn to visualize the deg_Real dictionary via histogram plot:
In [93]:
X = list(deg_Real.keys())
Y = list(deg_Real.values())
sns.barplot(x = Y, y = X, palette = "magma")
plt.xticks(range(0, max(Y)+5, 2))
plt.ylabel("Player Jersey number")
plt.xlabel("degree")
plt.title("Player pass degrees for Real Madrid vs Liverpool", size = 16)
plt.show()
  • Let us build the dataframe for Liverpool too:
In [94]:
# Prepare a dictionary with jersey numbers as the node ids, 
# i.e, the dictionary keys and degrees as the dictionary values
deg_Liv = dict(nx.degree(G_Liv)) 
# convert a dictionary to a pandas dataframe
degree_Liv = pd.DataFrame.from_dict(list(deg_Liv.items()))
degree_Liv.rename(columns = {0:'jersey_number', 1: 'node_degree'}, inplace = True)
degree_Liv
Out[94]:
jersey_number node_degree
0 5 12
1 26 11
2 7 17
3 14 17
4 1 7
5 66 13
6 4 12
7 11 11
8 6 12
9 9 10
10 19 6
  • We see that for Liverpool the degree value is highest (17) for players having jersey number 14 and 7, i,e 'Jordan Brian Henderson' and 'James Philip Milner' respectively. We will visualize the deg_Liv dictionary via histogram plot:
In [95]:
X = list(deg_Liv.keys())
Y = list(deg_Liv.values())
sns.barplot(x = Y, y = X, palette = "magma")
plt.xticks(range(0, max(Y)+5, 2))
plt.ylabel("Player Jersey number")
plt.xlabel("degree")
plt.title("Player pass degrees for Liverpool vs Real Madrid", size = 16)
plt.show()
  • We will visualize similar histogram plots for the indegrees and the outdegrees too:
In [96]:
indeg_Real = dict(G_Real.in_degree()) 
indegree_Real = pd.DataFrame.from_dict(list(indeg_Real.items())) 
indegree_Real.rename(columns = {0:'jersey_number', 1: 'node_indegree'}, inplace = True)
X = list(indeg_Real.keys())
Y = list(indeg_Real.values())
sns.barplot(x = Y, y = X, palette = "hls")
plt.xticks(range(0, max(Y)+5, 2))
plt.ylabel("Player Jersey number")
plt.xlabel("indegree")
plt.title("Player pass indegrees for Real Madrid vs Liverpool", size = 16)
plt.show()
In [97]:
indeg_Liv = dict(G_Liv.in_degree()) 
indegree_Liv = pd.DataFrame.from_dict(list(indeg_Liv.items())) 
indegree_Liv.rename(columns = {0:'jersey_number', 1: 'node_indegree'}, inplace = True)
X = list(indeg_Liv.keys())
Y = list(indeg_Liv.values())
sns.barplot(x = Y, y = X, palette = "hls")
plt.xticks(range(0, max(Y)+5, 2))
plt.ylabel("Player Jersey number")
plt.xlabel("indegree")
plt.title("Player pass indegrees for Liverpool vs Real Madrid", size = 16)
plt.show()
In [98]:
outdeg_Real = dict(G_Real.out_degree()) 
outdegree_Real = pd.DataFrame.from_dict(list(outdeg_Real.items())) 
outdegree_Real.rename(columns = {0:'jersey_number', 1: 'node_outdegree'}, inplace = True)
X = list(outdeg_Real.keys())
Y = list(outdeg_Real.values())
sns.barplot(x = Y, y = X, palette = "hls")
plt.xticks(range(0, max(Y)+5, 2))
plt.ylabel("Player Jersey number")
plt.xlabel("outdegree")
plt.title("Player pass outdegrees for Real Madrid vs Liverpool", size = 16)
plt.show()
In [99]:
outdeg_Liv = dict(G_Liv.out_degree()) 
outdegree_Liv = pd.DataFrame.from_dict(list(outdeg_Liv.items())) 
outdegree_Liv.rename(columns = {0:'jersey_number', 1: 'node_outdegree'}, inplace = True)
X = list(outdeg_Liv.keys())
Y = list(outdeg_Liv.values())
sns.barplot(x = Y, y = X, palette = "hls")
plt.xticks(range(0, max(Y)+5, 2))
plt.ylabel("Player Jersey number")
plt.xlabel("outdegree")
plt.title("Player pass outdegrees for Liverpool vs Real Madrid", size = 16)
plt.show()
  • Now, let us generate the adjacency matrices fr both G_Real and G_Liv graphs:
In [100]:
A_Real = nx.adjacency_matrix(G_Real)
A_Liv = nx.adjacency_matrix(G_Liv)
A_Real = A_Real.todense()
A_Liv = A_Liv.todense()
In [101]:
sns.heatmap(A_Real, annot = True, cmap ='gnuplot')
plt.title("Adjacency matrix for Real Madrid's pass network")
plt.show()
In [102]:
sns.heatmap(A_Liv, annot = True, cmap ='gnuplot')
plt.title("Adjacency matrix for Liverpool's pass network")
plt.show()
  • If we look into the diagonal of the adjacency matrices, we notice that all the values in the diagonals are 0. This depicts that their isn't any self loops in any nodes, indicating a player cannot pass to themselves.
  • The next step is to calculate the degree correlation coefficient of a graph. More specifically, we will calculate Pearson's degree correlation coefficient value. A positive value of the metric shows an overall positive relationship between the degrees (number of successful passes) of two adjacent nodes (players). Whereas a negative value shows an overall negative relationship. If it is 0, there is no relationship. Also the metric lies in [-1, 1], indicating -1 as the prefect negative relationship and 1 as the perfect positive relationship.
In [103]:
r_Real = nx.degree_pearson_correlation_coefficient(G_Real, weight = 'weight')
r_Liv = nx.degree_pearson_correlation_coefficient(G_Liv, weight = 'weight')
print(r_Real, r_Liv)
-0.17983836432860184 -0.24123721966990644
  • Now we work on a metric that focuses on the geodesic distance between two player nodes in a graph. One way to implement this is to divide 1 by the 'weight' column in the pass network. Let us create a new graph for Real Madrid:
In [104]:
pass_Real_mod = pass_Real_new[['pass_maker', 'pass_receiver']]
pass_Real_mod['1/nop'] = 1/pass_Real_new['number_of_passes']
pass_Real_mod.head(5)
Out[104]:
pass_maker pass_receiver 1/nop
0 14 2 1.000000
1 7 2 0.333333
2 22 2 0.500000
3 9 2 0.500000
4 10 2 0.100000
In [105]:
L_Real_mod = pass_Real_mod.apply(tuple, axis=1).tolist()

G_Real_mod = nx.DiGraph()

for i in range(len(L_Real_mod)):
    G_Real_mod.add_edge(L_Real_mod[i][0], L_Real_mod[i][1], weight = L_Real_mod[i][2])

edges_Real_mod = G_Real_mod.edges()
weights_Real_mod = [G_Real_mod[u][v]['weight'] for u, v in edges_Real_mod]

nx.draw(G_Real_mod, node_size=800, with_labels=True, node_color='white', width = weights_Real_mod)
plt.gca().collections[0].set_edgecolor('black')
plt.title("Modified pass network for Real Madrid vs Liverpool", size = 20)

plt.show()
  • We will perform the same operations to create a modified graph for Liverpool too:
In [106]:
pass_Liv_mod = pass_Liv_new[['pass_maker', 'pass_receiver']]
pass_Liv_mod['1/nop'] = 1/pass_Liv_new['number_of_passes']
pass_Liv_mod.head(5)
Out[106]:
pass_maker pass_receiver 1/nop
0 5 26 0.25
1 7 26 1.00
2 14 26 1.00
3 1 26 1.00
4 66 26 1.00
In [107]:
L_Liv_mod = pass_Liv_mod.apply(tuple, axis=1).tolist()

G_Liv_mod = nx.DiGraph()

for i in range(len(L_Liv_mod)):
    G_Liv_mod.add_edge(L_Liv_mod[i][0], L_Liv_mod[i][1], weight = L_Liv_mod[i][2])

edges_Liv_mod = G_Liv_mod.edges()
weights_Liv_mod = [G_Liv_mod[u][v]['weight'] for u, v in edges_Liv_mod]

nx.draw(G_Liv_mod, node_size=800, with_labels=True, node_color='red', width = weights_Liv_mod)
plt.gca().collections[0].set_edgecolor('black')
plt.title("Modified pass network for Liverpool vs Real Madrid", size = 20)

plt.show()
  • Now using these modified graphs we can calculate the all pair shortest paths between the nodes (players) for both the teams. Let us compute first for Real Madrid:
In [108]:
dis_Real = nx.shortest_path(G_Real_mod, weight = 'weight')
print(dis_Real)
{'14': {'14': ['14'], '2': ['14', '8', '10', '2'], '10': ['14', '8', '10'], '12': ['14', '8', '4', '12'], '5': ['14', '8', '5'], '4': ['14', '8', '4'], '8': ['14', '8'], '9': ['14', '8', '9'], '7': ['14', '8', '7'], '22': ['14', '8', '22'], '1': ['14', '8', '5', '1']}, '2': {'2': ['2'], '10': ['2', '10'], '5': ['2', '5'], '8': ['2', '10', '8'], '9': ['2', '5', '4', '12', '9'], '14': ['2', '14'], '7': ['2', '7'], '22': ['2', '22'], '1': ['2', '5', '1'], '12': ['2', '5', '4', '12'], '4': ['2', '5', '4']}, '7': {'7': ['7'], '2': ['7', '2'], '10': ['7', '2', '10'], '12': ['7', '12'], '4': ['7', '12', '8', '4'], '9': ['7', '12', '9'], '5': ['7', '2', '5'], '8': ['7', '12', '8'], '14': ['7', '12', '14'], '22': ['7', '12', '22'], '1': ['7', '2', '5', '1']}, '22': {'22': ['22'], '2': ['22', '2'], '10': ['22', '8', '10'], '12': ['22', '4', '12'], '4': ['22', '4'], '8': ['22', '8'], '9': ['22', '4', '12', '9'], '7': ['22', '7'], '5': ['22', '4', '5'], '1': ['22', '4', '5', '1'], '14': ['22', '8', '14']}, '9': {'9': ['9'], '2': ['9', '2'], '4': ['9', '8', '4'], '8': ['9', '8'], '14': ['9', '14'], '7': ['9', '8', '7'], '1': ['9', '1'], '10': ['9', '8', '10'], '12': ['9', '8', '4', '12'], '5': ['9', '8', '5'], '22': ['9', '8', '22']}, '10': {'10': ['10'], '2': ['10', '2'], '12': ['10', '8', '4', '12'], '5': ['10', '2', '5'], '4': ['10', '8', '4'], '8': ['10', '8'], '9': ['10', '8', '9'], '14': ['10', '2', '14'], '7': ['10', '2', '7'], '22': ['10', '8', '22'], '1': ['10', '2', '5', '1']}, '12': {'12': ['12'], '2': ['12', '2'], '10': ['12', '8', '10'], '5': ['12', '8', '5'], '4': ['12', '8', '4'], '8': ['12', '8'], '9': ['12', '9'], '14': ['12', '14'], '7': ['12', '7'], '22': ['12', '22'], '1': ['12', '8', '5', '1']}, '5': {'5': ['5'], '2': ['5', '10', '2'], '10': ['5', '10'], '4': ['5', '4'], '8': ['5', '8'], '9': ['5', '4', '12', '9'], '14': ['5', '8', '14'], '1': ['5', '1'], '12': ['5', '4', '12'], '7': ['5', '8', '7'], '22': ['5', '8', '22']}, '4': {'4': ['4'], '2': ['4', '2'], '10': ['4', '8', '10'], '12': ['4', '12'], '5': ['4', '5'], '8': ['4', '8'], '7': ['4', '12', '7'], '22': ['4', '8', '22'], '1': ['4', '5', '1'], '9': ['4', '12', '9'], '14': ['4', '12', '14']}, '8': {'8': ['8'], '2': ['8', '10', '2'], '10': ['8', '10'], '12': ['8', '4', '12'], '5': ['8', '5'], '4': ['8', '4'], '9': ['8', '9'], '14': ['8', '14'], '7': ['8', '7'], '22': ['8', '22'], '1': ['8', '5', '1']}, '1': {'1': ['1'], '12': ['1', '4', '12'], '5': ['1', '4', '5'], '4': ['1', '4'], '8': ['1', '4', '8'], '9': ['1', '4', '12', '9'], '2': ['1', '4', '2'], '10': ['1', '4', '8', '10'], '7': ['1', '4', '12', '7'], '22': ['1', '4', '8', '22'], '14': ['1', '4', '12', '14']}}
  • Suppose we want to calculate the shortest path from 'Keylor Navas Gamboa' (jersey number 1) to 'Cristiano Ronaldo dos Santos Aveiro' (jersey number 7). We will type the following:
In [109]:
print(dis_Real['1']['7'])
['1', '4', '12', '7']
  • So, we see that the fastest way possible to pass the ball from 'Keylor Navas Gamboa' (jersey: 1), to 'Cristiano Ronaldo dos Santos Aveiro' (jersey: 7) was to pass the ball first to 'Sergio Ramos García' (jersey: 4) who would pass to 'Marcelo Vieira da Silva Júnior' (jersey: 12) with him ultimately passing to 'Cristiano Ronaldo dos Santos Aveiro'. This seems like a good post-match analysis tool. I got this idea after discussing with Sarath Babu.
  • Let us do the same analysis for Liverpool:
In [110]:
dis_Liv = nx.shortest_path(G_Liv_mod, weight = 'weight')
print(dis_Liv)
{'5': {'5': ['5'], '26': ['5', '26'], '7': ['5', '26', '7'], '14': ['5', '14'], '4': ['5', '4'], '11': ['5', '11'], '66': ['5', '26', '7', '66'], '9': ['5', '26', '9'], '1': ['5', '14', '1'], '6': ['5', '14', '6'], '19': ['5', '26', '7', '19']}, '26': {'26': ['26'], '5': ['26', '5'], '7': ['26', '7'], '14': ['26', '14'], '9': ['26', '9'], '4': ['26', '4'], '11': ['26', '9', '11'], '66': ['26', '7', '66'], '1': ['26', '14', '1'], '6': ['26', '14', '6'], '19': ['26', '7', '19']}, '7': {'7': ['7'], '26': ['7', '66', '5', '26'], '5': ['7', '66', '5'], '14': ['7', '14'], '9': ['7', '66', '9'], '4': ['7', '4'], '1': ['7', '1'], '11': ['7', '66', '11'], '66': ['7', '66'], '6': ['7', '14', '6'], '19': ['7', '19']}, '14': {'14': ['14'], '26': ['14', '5', '26'], '5': ['14', '5'], '7': ['14', '7'], '4': ['14', '4'], '1': ['14', '1'], '66': ['14', '7', '66'], '6': ['14', '6'], '19': ['14', '7', '19'], '11': ['14', '7', '66', '11'], '9': ['14', '5', '26', '9']}, '1': {'1': ['1'], '26': ['1', '26'], '14': ['1', '14'], '4': ['1', '6', '4'], '6': ['1', '6'], '7': ['1', '6', '7'], '11': ['1', '6', '66', '11'], '66': ['1', '6', '66'], '5': ['1', '6', '66', '5'], '9': ['1', '6', '66', '9'], '19': ['1', '6', '7', '19']}, '66': {'66': ['66'], '26': ['66', '5', '26'], '5': ['66', '5'], '14': ['66', '14'], '9': ['66', '9'], '11': ['66', '11'], '6': ['66', '14', '6'], '7': ['66', '14', '7'], '4': ['66', '5', '4'], '19': ['66', '11', '19'], '1': ['66', '14', '1']}, '4': {'4': ['4'], '26': ['4', '26'], '5': ['4', '26', '5'], '14': ['4', '26', '14'], '66': ['4', '6', '66'], '6': ['4', '6'], '7': ['4', '26', '7'], '9': ['4', '26', '9'], '1': ['4', '6', '1'], '11': ['4', '6', '66', '11'], '19': ['4', '26', '7', '19']}, '11': {'11': ['11'], '5': ['11', '66', '5'], '7': ['11', '9', '7'], '9': ['11', '9'], '4': ['11', '4'], '66': ['11', '66'], '19': ['11', '19'], '14': ['11', '9', '14'], '6': ['11', '9', '14', '6'], '26': ['11', '66', '5', '26'], '1': ['11', '9', '14', '1']}, '6': {'6': ['6'], '7': ['6', '7'], '14': ['6', '66', '14'], '4': ['6', '4'], '1': ['6', '1'], '11': ['6', '66', '11'], '66': ['6', '66'], '26': ['6', '4', '26'], '5': ['6', '66', '5'], '9': ['6', '66', '9'], '19': ['6', '7', '19']}, '9': {'9': ['9'], '7': ['9', '7'], '14': ['9', '14'], '11': ['9', '11'], '66': ['9', '11', '66'], '6': ['9', '14', '6'], '5': ['9', '14', '5'], '4': ['9', '14', '4'], '19': ['9', '7', '19'], '26': ['9', '14', '5', '26'], '1': ['9', '14', '1']}, '19': {'19': ['19'], '7': ['19', '7'], '14': ['19', '14'], '9': ['19', '9'], '11': ['19', '9', '11'], '66': ['19', '9', '11', '66'], '6': ['19', '14', '6'], '5': ['19', '14', '5'], '4': ['19', '14', '4'], '26': ['19', '14', '5', '26'], '1': ['19', '14', '1']}}
In [111]:
print(dis_Liv['1']['9'])
['1', '6', '66', '9']
  • Now we will calculate another important metric called eccentricity, which is based on shortest distance. Eccentricity of a player node p tells us how far the furthest player node from p is positioned in the pass network. Let us calculate the eccentricities for all the 11 nodes for Real Madrid.
In [112]:
E_Real = nx.eccentricity(G_Real_mod)
print(E_Real)
{'14': 2, '2': 2, '7': 2, '22': 2, '9': 2, '10': 2, '12': 2, '5': 2, '4': 2, '8': 1, '1': 2}
  • We can calculate the average eccentricity:
In [113]:
av_E_Real = sum(list(E_Real.values()))/len(E_Real)
print(av_E_Real)
1.9090909090909092
  • For Liverpool:
In [114]:
E_Liv = nx.eccentricity(G_Liv_mod)
print(E_Liv)
{'5': 2, '26': 2, '7': 1, '14': 2, '1': 2, '66': 2, '4': 2, '11': 2, '6': 2, '9': 2, '19': 2}
  • We can calculate the average eccentricity:
In [115]:
av_E_Liv = sum(list(E_Liv.values()))/len(E_Liv)
print(av_E_Liv)
1.9090909090909092
  • We can also calculate the average clustering coefficient of a graph. Let us first compute this metric for G_Real (note that this graph should not be the modified version)
In [116]:
cc_Real = nx.average_clustering(G_Real, weight = 'weight')
print(cc_Real)
0.182334851979709
  • for Liverpool:
In [117]:
cc_Liv = nx.average_clustering(G_Liv, weight = 'weight')
print(cc_Liv)
0.2766427842450553
  • The average clustering coefficient lies in the range [0, 1] where, a value of 0 denotes the fact that none of the nodes are connected to each other and a value of 1 denotes that the network is a clique, that is each node is connected to all the other nodes of the network. We see that interestingly the average clustering coefficient is lesser for Real Madrid's pass network stating the fact that a lesser number of players passed the ball among each other, compared to that of Liverpool.
  • Finally, we can compute the centrality (especially the betweenness centrality) for each node in either team's pass network and understand which player was the most important in their pass network. For Real Madrid:
In [118]:
bc_Real = nx.betweenness_centrality(G_Real, weight = 'weight')
print(bc_Real)
{'14': 0.15222222222222223, '2': 0.10685185185185186, '7': 0.05592592592592593, '22': 0.0, '9': 0.14462962962962964, '10': 0.12407407407407407, '12': 0.009259259259259259, '5': 0.007407407407407408, '4': 0.06851851851851852, '8': 0.031481481481481485, '1': 0.11703703703703704}
  • we can find the node which has the maximum betweenness centrality measure.
In [119]:
max_bc_Real = max(bc_Real, key = bc_Real.get)
print(max_bc_Real)
14
  • For Liverpool:
In [120]:
bc_Liv = nx.betweenness_centrality(G_Liv, weight = 'weight')
print(bc_Liv)
max_bc_Liv = max(bc_Liv, key = bc_Liv.get)
print(max_bc_Liv)
{'5': 0.06296296296296296, '26': 0.016666666666666666, '7': 0.2453703703703704, '14': 0.12407407407407407, '1': 0.002777777777777778, '66': 0.075, '4': 0.07222222222222222, '11': 0.05555555555555556, '6': 0.1259259259259259, '9': 0.021296296296296296, '19': 0.03888888888888889}
7
  • So we see that the betweenness centrality measure is max for 'Carlos Henrique Casimiro' (jersey: 4) from Real Madrid and 'James Philip Milner' (jersey: 7) from Liverpool. We have been able to compute some interesting results using complex network analysis on our pass networks.

Visualizing Convex Hulls from player's event data¶

  • First we will study how to develop a convex hull around those points (locations denoted by x- and y- coordinates) from where a player had made a pass or had taken a shot in a particular game.
  • Mathematically, if these points are contained in a set X then the convex hull is the smallest convex set that contains X. This will help us get an idea about the optimal field coverage of a player during the match.
  • Let us see how a convex hull for a set of points looks like:

convexhull.png

  • This figure has been adapted from the wikipedia article on convex hulls.
  • Before we start with our data collection and analysis we need to download the scipy package which provides us with a collection of modules for working on scientific computation with Python.
  • For this section, we need to use the scipy.spatial module that allows us to work with spatial algorithms and data structures. As we are going to work with convex hulls first, let us import the ConvexHull classes from scipy.spatial:
In [121]:
from scipy.spatial import ConvexHull
  • As we have been doing till now, let us pick the important columns from the events dataset:
In [135]:
events_hull = events[['team', 'location', 'type', 'player']]
events_hull.head(10)
Out[135]:
team location type player
0 Real Madrid NaN Starting XI NaN
1 Liverpool NaN Starting XI NaN
2 Real Madrid NaN Half Start NaN
3 Liverpool NaN Half Start NaN
4 Liverpool NaN Half Start NaN
5 Real Madrid NaN Half Start NaN
6 Liverpool [60.0, 40.0] Pass James Philip Milner
7 Liverpool [35.0, 40.8] Pass Dejan Lovren
8 Real Madrid [27.4, 60.2] Pass Raphaël Varane
9 Real Madrid [35.3, 75.4] Pass Luka Modrić
  • Seems like we only need four columns for now. As we are only focusing on pass and shot events, we will first filter the dataset by setting type to Pass or Shot.
In [136]:
events_hull = events_hull[(events_hull['type'] == 'Pass') | (events_hull['type'] == 'Shot')].reset_index()
events_hull.head(10)
Out[136]:
index team location type player
0 6 Liverpool [60.0, 40.0] Pass James Philip Milner
1 7 Liverpool [35.0, 40.8] Pass Dejan Lovren
2 8 Real Madrid [27.4, 60.2] Pass Raphaël Varane
3 9 Real Madrid [35.3, 75.4] Pass Luka Modrić
4 10 Real Madrid [22.3, 76.6] Pass Daniel Carvajal Ramos
5 11 Real Madrid [36.2, 75.3] Pass Carlos Henrique Casimiro
6 12 Liverpool [76.5, 18.1] Pass Jordan Brian Henderson
7 13 Liverpool [84.4, 10.0] Pass Sadio Mané
8 14 Liverpool [91.6, 21.3] Pass Roberto Firmino Barbosa de Oliveira
9 15 Liverpool [92.2, 50.9] Pass Mohamed Salah
  • Then, we will split the location column into location_x and location_y columns:
In [137]:
Loc = events_hull['location']
Loc = pd.DataFrame(Loc.to_list(), columns=['location_x', 'location_y'])
events_hull['location_x'] = Loc['location_x']
events_hull['location_y'] = Loc['location_y']
events_hull.head(10)
Out[137]:
index team location type player location_x location_y
0 6 Liverpool [60.0, 40.0] Pass James Philip Milner 60.0 40.0
1 7 Liverpool [35.0, 40.8] Pass Dejan Lovren 35.0 40.8
2 8 Real Madrid [27.4, 60.2] Pass Raphaël Varane 27.4 60.2
3 9 Real Madrid [35.3, 75.4] Pass Luka Modrić 35.3 75.4
4 10 Real Madrid [22.3, 76.6] Pass Daniel Carvajal Ramos 22.3 76.6
5 11 Real Madrid [36.2, 75.3] Pass Carlos Henrique Casimiro 36.2 75.3
6 12 Liverpool [76.5, 18.1] Pass Jordan Brian Henderson 76.5 18.1
7 13 Liverpool [84.4, 10.0] Pass Sadio Mané 84.4 10.0
8 14 Liverpool [91.6, 21.3] Pass Roberto Firmino Barbosa de Oliveira 91.6 21.3
9 15 Liverpool [92.2, 50.9] Pass Mohamed Salah 92.2 50.9
  • we can discard the location column:
In [138]:
events_hull = events_hull[['team', 'type', 'player', 'location_x', 'location_y']]
events_hull.head(10)
Out[138]:
team type player location_x location_y
0 Liverpool Pass James Philip Milner 60.0 40.0
1 Liverpool Pass Dejan Lovren 35.0 40.8
2 Real Madrid Pass Raphaël Varane 27.4 60.2
3 Real Madrid Pass Luka Modrić 35.3 75.4
4 Real Madrid Pass Daniel Carvajal Ramos 22.3 76.6
5 Real Madrid Pass Carlos Henrique Casimiro 36.2 75.3
6 Liverpool Pass Jordan Brian Henderson 76.5 18.1
7 Liverpool Pass Sadio Mané 84.4 10.0
8 Liverpool Pass Roberto Firmino Barbosa de Oliveira 91.6 21.3
9 Liverpool Pass Mohamed Salah 92.2 50.9
  • We will next split the data into two datasets, one for Real Madrid and the other for Liverpool:
In [139]:
events_hull_Real = events_hull[events_hull['team'] == 'Real Madrid'].reset_index()
events_hull_Liv = events_hull[events_hull['team'] == 'Liverpool'].reset_index()
In [140]:
events_hull_Real.head(5)
Out[140]:
index team type player location_x location_y
0 2 Real Madrid Pass Raphaël Varane 27.4 60.2
1 3 Real Madrid Pass Luka Modrić 35.3 75.4
2 4 Real Madrid Pass Daniel Carvajal Ramos 22.3 76.6
3 5 Real Madrid Pass Carlos Henrique Casimiro 36.2 75.3
4 10 Real Madrid Pass Sergio Ramos García 14.7 23.2
In [141]:
events_hull_Liv.head(5)
Out[141]:
index team type player location_x location_y
0 0 Liverpool Pass James Philip Milner 60.0 40.0
1 1 Liverpool Pass Dejan Lovren 35.0 40.8
2 6 Liverpool Pass Jordan Brian Henderson 76.5 18.1
3 7 Liverpool Pass Sadio Mané 84.4 10.0
4 8 Liverpool Pass Roberto Firmino Barbosa de Oliveira 91.6 21.3
  • Next, we will list down the name of the players from both the teams:
In [142]:
players_Real = events_hull_Real.player.unique()
players_Liv = events_hull_Liv.player.unique()
print(players_Real)
print(players_Liv)
['Raphaël Varane' 'Luka Modrić' 'Daniel Carvajal Ramos'
 'Carlos Henrique Casimiro' 'Sergio Ramos García'
 'Marcelo Vieira da Silva Júnior' 'Toni Kroos'
 'Cristiano Ronaldo dos Santos Aveiro' 'Karim Benzema'
 'Keylor Navas Gamboa' 'Francisco Román Alarcón Suárez'
 'José Ignacio Fernández Iglesias' 'Gareth Frank Bale'
 'Marco Asensio Willemsen']
['James Philip Milner' 'Dejan Lovren' 'Jordan Brian Henderson'
 'Sadio Mané' 'Roberto Firmino Barbosa de Oliveira' 'Mohamed Salah'
 'Trent Alexander-Arnold' 'Virgil van Dijk' 'Andrew Robertson'
 'Georginio Wijnaldum' 'Loris Karius' 'Adam David Lallana' 'Emre Can']
  • We will now extract the event data for 'Toni Kroos' from events_hull_Real.
In [143]:
events_hull_Toni = events_hull_Real[events_hull_Real['player'] == 'Toni Kroos']
events_hull_Toni
Out[143]:
index team type player location_x location_y
7 13 Real Madrid Pass Toni Kroos 48.8 13.9
15 22 Real Madrid Pass Toni Kroos 23.4 18.6
30 73 Real Madrid Pass Toni Kroos 35.0 24.9
36 79 Real Madrid Pass Toni Kroos 41.7 21.7
40 83 Real Madrid Pass Toni Kroos 50.6 28.3
... ... ... ... ... ... ...
638 969 Real Madrid Pass Toni Kroos 120.0 80.0
639 970 Real Madrid Pass Toni Kroos 120.0 80.0
641 972 Real Madrid Pass Toni Kroos 96.8 73.1
666 1020 Real Madrid Pass Toni Kroos 120.0 0.1
672 1032 Real Madrid Pass Toni Kroos 56.9 41.5

92 rows × 6 columns

  • Before computing and visualizing the convex hull, it is a good practice to discard the outliers from the datasets. A common method that researchers use is the Inter Quartile Range. We will find the inter quartile ranges for the columns location_x and location_y from events_hull_Toni and then compute the upper and lower bounds of the data. Any points lying beyond these bounds, i.e any point lying above the lower bound and any point lying below the upper bound, are decided to be outliers and are discarded. We use box plots and whisker plots to visualize the interquartile range for the datapoints:
In [144]:
e_box = pd.DataFrame(data = events_hull_Toni, columns = ["location_x", "location_y"])
boxplot = sns.boxplot(x = "variable", y ="value", data=pd.melt(e_box), 
                      order = ["location_x", "location_y"])
boxplot = sns.stripplot(x = "variable", y = "value", data = pd.melt(e_box), marker="o",
                        color="red", order = ["location_x", "location_y"])
boxplot.axes.set_title("Boxplot for Toni Kroos's location conditions")
plt.show()
  • We will next compute the quartiles, the inter quartile range and the minimum and maximum values:
In [145]:
Q1 = np.percentile(events_hull_Toni['location_x'], 25, interpolation='midpoint')
Q3 = np.percentile(events_hull_Toni['location_x'], 75, interpolation='midpoint')
IQR_x = Q3 - Q1

minimum_x = Q1 - 1.5*IQR_x
maximum_x = Q3 + 1.5*IQR_x
Q1, Q3, IQR_x, minimum_x, maximum_x
Out[145]:
(47.400000000000006,
 67.85,
 20.44999999999999,
 16.725000000000023,
 98.52499999999998)
In [146]:
Q1 = np.percentile(events_hull_Toni['location_y'], 25, interpolation='midpoint')
Q3 = np.percentile(events_hull_Toni['location_y'], 75, interpolation='midpoint')
IQR_y = Q3 - Q1

minimum_y = Q1 - 1.5*IQR_y
maximum_y = Q3 + 1.5*IQR_y
Q1, Q3, IQR_y, minimum_y, maximum_y
Out[146]:
(15.0, 41.8, 26.799999999999997, -25.199999999999996, 82.0)
In [147]:
upper = np.where((events_hull_Toni['location_x'] >= maximum_x) & (events_hull_Toni['location_y'] >= maximum_y))
lower = np.where((events_hull_Toni['location_x'] <= minimum_x) & (events_hull_Toni['location_y'] <= minimum_y))
  • Finally, we will drop the outliers if present:
In [148]:
events_hull_Toni.drop(upper[0], inplace = True)
events_hull_Toni.drop(lower[0], inplace = True)
  • Let us look into the events_hull_Toni dataset:
In [149]:
events_hull_Toni = events_hull_Toni.reset_index()
events_hull_Toni = events_hull_Toni[['team', 'type', 'player', 'location_x', 'location_y']]
events_hull_Toni.head(10)
Out[149]:
team type player location_x location_y
0 Real Madrid Pass Toni Kroos 48.8 13.9
1 Real Madrid Pass Toni Kroos 23.4 18.6
2 Real Madrid Pass Toni Kroos 35.0 24.9
3 Real Madrid Pass Toni Kroos 41.7 21.7
4 Real Madrid Pass Toni Kroos 50.6 28.3
5 Real Madrid Pass Toni Kroos 42.2 11.1
6 Real Madrid Pass Toni Kroos 48.7 53.1
7 Real Madrid Pass Toni Kroos 56.7 59.6
8 Real Madrid Pass Toni Kroos 56.4 15.2
9 Real Madrid Pass Toni Kroos 42.9 9.4
  • First we collect all the points from the two columns as a 2-D matrix. This comes in aid while drawing the convex hull.
In [150]:
points_hull = events_hull_Toni[['location_x', 'location_y']].values
  • Now, let us use the ConvexHull() function from scipy.spatial:
In [151]:
convex_hull_Toni = ConvexHull(events_hull_Toni[['location_x', 'location_y']])
  • This convex hull is represented by the vertices, i.e the coordinate points that make the vertices of the convex hull and the simplices, i.e the stratight line in case of a 2-D plane that connects the vertices of the the convex hull. The vertices attribute consists of the indices of the points in points_hull that make up the convex hull, and the simplices attribute too consists of the indices of the points in points_hull. The simplices are a list of 1-D simplices of a particular length, representing line segments in 2-D. Let us print the indices:
In [152]:
print(convex_hull_Toni.vertices)
[50 41 55 75 84  1 67 51]
In [153]:
print(convex_hull_Toni.simplices)
[[50 41]
 [67  1]
 [84  1]
 [84 75]
 [55 41]
 [55 75]
 [51 50]
 [51 67]]
  • We have collected all the useful information and will visualize the convex hull on a football pitch:
In [154]:
pitch = Pitch(pitch_color='grass', stripe = True, line_color='black', goal_type='box', 
              constrained_layout=True, tight_layout=False)
fig, ax = pitch.draw()

plt.scatter(events_hull_Toni.location_x, events_hull_Toni.location_y, color='white')

for i in convex_hull_Toni.simplices:
    plt.plot(points_hull[i, 0], points_hull[i, 1], 'black')
    plt.fill(points_hull[convex_hull_Toni.vertices, 0], points_hull[convex_hull_Toni.vertices, 1], 
             c='grey', alpha=0.1)

plt.title("Convex Hull for Toni Kroos's field coverage against Liverpool")
Out[154]:
Text(0.5, 1.0, "Convex Hull for Toni Kroos's field coverage against Liverpool")
  • We can draw the convex hulls for other players too from either of the teams

1.png

Tracking data, Delaunay Triangulations and Voronoi Diagrams¶

  • So, we have been able to compute and visualize the convex hulls for players from a particular game. Next, we will try to understand how to get tracking data from a particular game using statsbomb api. We need tracking data to compute Delaunay triangulations and Voronoi diagrams.

  • The match id that we have been working with is 18245.

  • We need to first import useful classes from the mplsoccer.statsbomb module:

In [155]:
from mplsoccer.statsbomb import read_event, EVENT_SLUG
  • Next, we will use the code from here to extract the tracking data for the match:
In [164]:
event_json = read_event(f'{EVENT_SLUG}/18245.json', related_event_df = False, 
                        tactics_lineup_df = False, warn = False)
event = event_json['event']
tracking = event_json['shot_freeze_frame']
  • Let us look at the event and tracking datasets:
In [165]:
event.head(5)
Out[165]:
match_id id index period timestamp_minute timestamp_second timestamp_millisecond minute second type_id ... injury_stoppage_in_chain shot_statsbomb_xg shot_key_pass_id shot_first_time shot_one_on_one shot_redirect substitution_replacement_id substitution_replacement_name tactics_formation aerial_won
0 18245 5eee3ffd-f0c0-4532-868b-4a66cbf20cb8 1 1 0 0 0 0 0 35 ... NaN NaN NaN NaN NaN NaN NaN NaN 41212.0 NaN
1 18245 eaa65a92-02d3-4375-b2b7-7c2f679a620c 2 1 0 0 0 0 0 35 ... NaN NaN NaN NaN NaN NaN NaN NaN 433.0 NaN
2 18245 9c82d2e5-ebba-4825-b7f9-b11b04433ed8 3 1 0 0 0 0 0 18 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 18245 b791047a-3eea-452f-b3a9-212bd40cd7cb 4 1 0 0 0 0 0 18 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 18245 25be91a5-a084-42cb-8cc1-a0fe7b0f52f9 5 1 0 0 371 0 0 30 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 77 columns

In [166]:
event.tail(5)
Out[166]:
match_id id index period timestamp_minute timestamp_second timestamp_millisecond minute second type_id ... injury_stoppage_in_chain shot_statsbomb_xg shot_key_pass_id shot_first_time shot_one_on_one shot_redirect substitution_replacement_id substitution_replacement_name tactics_formation aerial_won
3492 18245 b4258521-d4ec-466d-a90c-e4522692a45b 3493 2 47 30 959 92 30 30 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3493 18245 37f51448-ebd1-4d67-8d9e-fa4b450111b2 3494 2 47 33 52 92 33 42 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3494 18245 e9f7bb50-f4fc-45aa-87d3-20bbe9ebd32f 3495 2 47 39 157 92 39 40 ... True NaN NaN NaN NaN NaN NaN NaN NaN NaN
3495 18245 ce7d446a-e8bf-4631-bcf5-2bd323ba251e 3496 2 48 2 893 93 2 34 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3496 18245 d19b2348-de55-4bbf-9b1f-e44d95aa3a77 3497 2 48 2 893 93 2 34 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 77 columns

In [167]:
tracking.head(5)
Out[167]:
id event_freeze_id player_teammate player_id player_name player_position_id player_position_name x y match_id
0 682270cc-4bc4-4952-8f91-d3c5a704a691 1 False 5463 Luka Modrić 13 Right Center Midfield 98.0 48.4 18245
1 9f5aa3eb-3bed-4bc0-97a5-bb8444b235b9 1 True 3535 Roberto Firmino Barbosa de Oliveira 23 Center Forward 109.0 39.9 18245
2 399ac143-5f7b-4080-8c0b-3c18435d7fc1 1 True 3655 Andrew Robertson 6 Left Back 102.1 2.5 18245
3 660d9d98-46b6-4b5e-9c9a-435d63142c93 1 True 4926 Francisco Román Alarcón Suárez 19 Center Attacking Midfield 100.2 11.0 18245
4 fe6c7f60-2ff0-4077-882e-b045c8abc7c3 1 True 3629 Sadio Mané 21 Left Wing 90.9 32.3 18245
In [168]:
tracking.tail(5)
Out[168]:
id event_freeze_id player_teammate player_id player_name player_position_id player_position_name x y match_id
356 18f64bd1-c8a9-4f31-9e58-3ec7a1de0a80 16 False 5463 Luka Modrić 13 Right Center Midfield 99.9 19.0 18245
357 9f5aa3eb-3bed-4bc0-97a5-bb8444b235b9 17 False 5463 Luka Modrić 13 Right Center Midfield 99.2 50.3 18245
358 18f64bd1-c8a9-4f31-9e58-3ec7a1de0a80 17 False 5201 Sergio Ramos García 5 Left Center Back 114.1 42.9 18245
359 9f5aa3eb-3bed-4bc0-97a5-bb8444b235b9 18 False 5574 Toni Kroos 15 Left Center Midfield 102.7 37.0 18245
360 18f64bd1-c8a9-4f31-9e58-3ec7a1de0a80 18 False 5485 Raphaël Varane 3 Right Center Back 114.4 37.3 18245
  • Looking at the two datasets event and tracking, we understand that, the former represents the event data and the later represents the tracking data. Let us look into the columns of the tracking dataset:
In [169]:
print(tracking.columns)
Index(['id', 'event_freeze_id', 'player_teammate', 'player_id', 'player_name',
       'player_position_id', 'player_position_name', 'x', 'y', 'match_id'],
      dtype='object')
  • If we look closely into the tracking dataset, we understand that the column id represents an unique id for a shot freeze frame, i.e, it gives the unique id for the moment when a particular player was taking a shot along with the information about locations of the other players. Looking at the player_name column, we need to add a column team to the tracking dataset, giving us information about which team the shot taker belongs to.
In [170]:
tracking['team'] = 0
for i in range(len(tracking)):
    if tracking['player_name'][i] in players_Real:
        tracking['team'][i] = 'Real Madrid'
    else:
        tracking['team'][i] = 'Liverpool'
In [171]:
tracking.head(5)
Out[171]:
id event_freeze_id player_teammate player_id player_name player_position_id player_position_name x y match_id team
0 682270cc-4bc4-4952-8f91-d3c5a704a691 1 False 5463 Luka Modrić 13 Right Center Midfield 98.0 48.4 18245 Real Madrid
1 9f5aa3eb-3bed-4bc0-97a5-bb8444b235b9 1 True 3535 Roberto Firmino Barbosa de Oliveira 23 Center Forward 109.0 39.9 18245 Liverpool
2 399ac143-5f7b-4080-8c0b-3c18435d7fc1 1 True 3655 Andrew Robertson 6 Left Back 102.1 2.5 18245 Liverpool
3 660d9d98-46b6-4b5e-9c9a-435d63142c93 1 True 4926 Francisco Román Alarcón Suárez 19 Center Attacking Midfield 100.2 11.0 18245 Real Madrid
4 fe6c7f60-2ff0-4077-882e-b045c8abc7c3 1 True 3629 Sadio Mané 21 Left Wing 90.9 32.3 18245 Liverpool
  • Now, we will only extract the relevant columns:
In [172]:
tracking = tracking[['id', 'player_name', 'x', 'y', 'team']]
tracking.head(5)
Out[172]:
id player_name x y team
0 682270cc-4bc4-4952-8f91-d3c5a704a691 Luka Modrić 98.0 48.4 Real Madrid
1 9f5aa3eb-3bed-4bc0-97a5-bb8444b235b9 Roberto Firmino Barbosa de Oliveira 109.0 39.9 Liverpool
2 399ac143-5f7b-4080-8c0b-3c18435d7fc1 Andrew Robertson 102.1 2.5 Liverpool
3 660d9d98-46b6-4b5e-9c9a-435d63142c93 Francisco Román Alarcón Suárez 100.2 11.0 Real Madrid
4 fe6c7f60-2ff0-4077-882e-b045c8abc7c3 Sadio Mané 90.9 32.3 Liverpool
  • Let us now try collecting the jersey numbers of the players from both the teams. We will use a different and easier approach from the one we have done here. To get the player information, use the following command, py passing the match id:
In [175]:
player_info = sb.lineups(match_id = 18245)
credentials were not supplied. open data access only
  • player_info has information about both the teams. Let us fetch for Real Madrid first:
In [177]:
info_Real = player_info['Real Madrid']
info_Real
Out[177]:
player_id player_name player_nickname jersey_number country
0 4926 Francisco Román Alarcón Suárez Isco 22 Spain
1 5200 Lucas Vázquez Iglesias Lucas Vázquez 17 Spain
2 5201 Sergio Ramos García Sergio Ramos 4 Spain
3 5202 José Ignacio Fernández Iglesias Nacho 6 Spain
4 5207 Cristiano Ronaldo dos Santos Aveiro Cristiano Ronaldo 7 Portugal
5 5456 Mateo Kovačić None 23 Croatia
6 5463 Luka Modrić None 10 Croatia
7 5485 Raphaël Varane None 5 France
8 5539 Carlos Henrique Casimiro Casemiro 14 Brazil
9 5552 Marcelo Vieira da Silva Júnior Marcelo 12 Brazil
10 5574 Toni Kroos None 8 Germany
11 5597 Keylor Navas Gamboa Keylor Navas 1 Costa Rica
12 5719 Marco Asensio Willemsen Marco Asensio 20 Spain
13 5721 Daniel Carvajal Ramos Daniel Carvajal 2 Spain
14 6399 Gareth Frank Bale Gareth Bale 11 Wales
15 6704 Theo Bernard François Hernández Theo Hernández 15 France
16 6706 Francisco Casilla Cortés Kiko Casilla 13 Spain
17 19677 Karim Benzema None 9 France
  • Let us only consider the player_name and jersey_number columns and build a dictionary:
In [178]:
info_Real = info_Real[['player_name', 'jersey_number']]
jerseys_Real = {}

for i in range(len(info_Real)):
    jerseys_Real[info_Real.player_name[i]] = str(info_Real.jersey_number[i])
print(jerseys_Real)
{'Francisco Román Alarcón Suárez': '22', 'Lucas Vázquez Iglesias': '17', 'Sergio Ramos García': '4', 'José Ignacio Fernández Iglesias': '6', 'Cristiano Ronaldo dos Santos Aveiro': '7', 'Mateo Kovačić': '23', 'Luka Modrić': '10', 'Raphaël Varane': '5', 'Carlos Henrique Casimiro': '14', 'Marcelo Vieira da Silva Júnior': '12', 'Toni Kroos': '8', 'Keylor Navas Gamboa': '1', 'Marco Asensio Willemsen': '20', 'Daniel Carvajal Ramos': '2', 'Gareth Frank Bale': '11', 'Theo Bernard François Hernández': '15', 'Francisco Casilla Cortés': '13', 'Karim Benzema': '9'}
  • Same thing for Liverpool:
In [180]:
info_Liv = player_info['Liverpool']
info_Liv = info_Liv[['player_name', 'jersey_number']]
jerseys_Liv = {}

for i in range(len(info_Liv)):
    jerseys_Liv[info_Liv.player_name[i]] = str(info_Liv.jersey_number[i])
print(jerseys_Liv)
{'Dejan Lovren': '6', 'James Philip Milner': '7', 'Emre Can': '23', 'Alberto Moreno Pérez': '18', 'Mohamed Salah': '11', 'Jordan Brian Henderson': '14', 'Roberto Firmino Barbosa de Oliveira': '9', 'Simon Mignolet': '22', 'Georginio Wijnaldum': '5', 'Dominic Solanke': '29', 'Sadio Mané': '19', 'Loris Karius': '1', 'Andrew Robertson': '26', 'Trent Alexander-Arnold': '66', 'Virgil van Dijk': '4', 'Adam David Lallana': '20', 'Ragnar Klavan': '17', 'Nathaniel Edwin Clyne': '2'}
  • Now let us select a particular id from the tracking dataset, representing an instance when a particular shot was taken. We will filter tracking by a id value which will give us the information of the locations of the players on the pitch at that moment. We can view the unique id values:
In [183]:
tracking.id.unique()
Out[183]:
array(['682270cc-4bc4-4952-8f91-d3c5a704a691',
       '9f5aa3eb-3bed-4bc0-97a5-bb8444b235b9',
       '399ac143-5f7b-4080-8c0b-3c18435d7fc1',
       '660d9d98-46b6-4b5e-9c9a-435d63142c93',
       'fe6c7f60-2ff0-4077-882e-b045c8abc7c3',
       'eda7e108-2479-46f2-9cd0-a0bc2939e352',
       'c36dfe04-2f8e-48f0-8df6-1c4d0b93a16e',
       '3e93f456-9971-4a33-9b10-ee9961410a32',
       '9def9ed2-52f0-496b-8ae8-f4c5a97c2d8a',
       '20b934f1-9afa-401d-9a16-f97fea2b80d9',
       '6711367a-6855-4914-903e-a5e19771429c',
       'e8c20962-0eef-4066-97ce-dcaad4f70b52',
       '02f0755f-76cf-4d30-8062-369dc9509bdd',
       '6cb4171b-90e6-4473-831e-df7a2da29f28',
       '93c40040-ab9a-4549-8f0e-46c5c1c8e9cd',
       '142e18c8-316a-4f9f-a0f8-3c41549ad1c3',
       '6f994944-70fc-4a30-acca-315e3fede0bb',
       '7654fe57-734f-45d8-bc83-ab940cd37c45',
       '30a872eb-fe88-4c46-858b-a4f487cb69e4',
       '53b73ee0-8c9c-4b64-83c5-69fc453376a1',
       '804f8c8e-d714-4e6a-9cd1-599665efb8c8',
       '36687201-f131-4418-9dd0-f632bc9c4257',
       '650a2dc2-e5bb-4fac-9259-afbc03bdc322',
       '312f9c86-6a3c-42b1-bdeb-f92cb1b16a48',
       '222c90b6-8293-409a-ac6d-e2c3c2e69948',
       'c7f3935c-23fa-4ddc-a6ee-eb9d0972d034',
       '05688a6e-37f8-4aa6-a36e-d8151aa75997',
       '18f64bd1-c8a9-4f31-9e58-3ec7a1de0a80'], dtype=object)
  • Let us filter the dataset now:
In [245]:
shot_id = '3e93f456-9971-4a33-9b10-ee9961410a32' # select a particular value from the id column
tracking_filtered = tracking[tracking['id'] == shot_id] # filter by the shot_id
event_filtered = event[event['id'] == shot_id]

event_filtered = event_filtered[['id', 'player_name', 'x', 'y', 'team_name']]
event_filtered = event_filtered.rename(columns = {'team_name':'team'})

data_filtered = pd.concat([event_filtered, tracking_filtered])
  • The data_filtered dataset looks like this:
In [246]:
data_filtered
Out[246]:
id player_name x y team
747 3e93f456-9971-4a33-9b10-ee9961410a32 Cristiano Ronaldo dos Santos Aveiro 111.7 58.7 Real Madrid
7 3e93f456-9971-4a33-9b10-ee9961410a32 Loris Karius 118.1 45.0 Liverpool
35 3e93f456-9971-4a33-9b10-ee9961410a32 Roberto Firmino Barbosa de Oliveira 100.8 49.0 Liverpool
63 3e93f456-9971-4a33-9b10-ee9961410a32 Daniel Carvajal Ramos 100.9 50.2 Real Madrid
91 3e93f456-9971-4a33-9b10-ee9961410a32 James Philip Milner 91.3 28.4 Liverpool
119 3e93f456-9971-4a33-9b10-ee9961410a32 Karim Benzema 108.9 37.9 Real Madrid
147 3e93f456-9971-4a33-9b10-ee9961410a32 Georginio Wijnaldum 105.7 56.5 Liverpool
175 3e93f456-9971-4a33-9b10-ee9961410a32 Jordan Brian Henderson 108.0 50.0 Liverpool
202 3e93f456-9971-4a33-9b10-ee9961410a32 Virgil van Dijk 111.7 54.7 Liverpool
228 3e93f456-9971-4a33-9b10-ee9961410a32 Trent Alexander-Arnold 105.2 35.3 Liverpool
254 3e93f456-9971-4a33-9b10-ee9961410a32 Dejan Lovren 111.8 41.1 Liverpool
280 3e93f456-9971-4a33-9b10-ee9961410a32 Toni Kroos 91.0 30.3 Real Madrid
304 3e93f456-9971-4a33-9b10-ee9961410a32 Francisco Román Alarcón Suárez 102.4 40.6 Real Madrid
  • We will compute the Delaunay triangulations from a team's players' locations to get an idea about the possible links created among the teammates by the placement of the players.
  • This wikipedia article states that for a set X consisting of points on a 2-D Euclidean surface, a Delaunay triangulation is a type of geometric triangulation such that no points in X lies inside the circum-circle of any triangle in the triangulation. A representation of the Delaunay triangle from the same wikipedia article: delaunay.png
  • We also need to import Delaunay from scipy.spatial to compute the triangulation:
In [247]:
from scipy.spatial import Delaunay
  • Next, let us separate the data_filtered for the teams:
In [248]:
tracking_Real = data_filtered[data_filtered['team'] == 'Real Madrid'].reset_index()
tracking_Liv = data_filtered[data_filtered['team'] == 'Liverpool'].reset_index()
In [249]:
tracking_Real
Out[249]:
index id player_name x y team
0 747 3e93f456-9971-4a33-9b10-ee9961410a32 Cristiano Ronaldo dos Santos Aveiro 111.7 58.7 Real Madrid
1 63 3e93f456-9971-4a33-9b10-ee9961410a32 Daniel Carvajal Ramos 100.9 50.2 Real Madrid
2 119 3e93f456-9971-4a33-9b10-ee9961410a32 Karim Benzema 108.9 37.9 Real Madrid
3 280 3e93f456-9971-4a33-9b10-ee9961410a32 Toni Kroos 91.0 30.3 Real Madrid
4 304 3e93f456-9971-4a33-9b10-ee9961410a32 Francisco Román Alarcón Suárez 102.4 40.6 Real Madrid
In [250]:
tracking_Liv
Out[250]:
index id player_name x y team
0 7 3e93f456-9971-4a33-9b10-ee9961410a32 Loris Karius 118.1 45.0 Liverpool
1 35 3e93f456-9971-4a33-9b10-ee9961410a32 Roberto Firmino Barbosa de Oliveira 100.8 49.0 Liverpool
2 91 3e93f456-9971-4a33-9b10-ee9961410a32 James Philip Milner 91.3 28.4 Liverpool
3 147 3e93f456-9971-4a33-9b10-ee9961410a32 Georginio Wijnaldum 105.7 56.5 Liverpool
4 175 3e93f456-9971-4a33-9b10-ee9961410a32 Jordan Brian Henderson 108.0 50.0 Liverpool
5 202 3e93f456-9971-4a33-9b10-ee9961410a32 Virgil van Dijk 111.7 54.7 Liverpool
6 228 3e93f456-9971-4a33-9b10-ee9961410a32 Trent Alexander-Arnold 105.2 35.3 Liverpool
7 254 3e93f456-9971-4a33-9b10-ee9961410a32 Dejan Lovren 111.8 41.1 Liverpool
  • Now, we are going to build the Delaunay triangulations for Real Madrid's attack at the particular instance. Similar to the one we did for Convex hulls, we will first convert the locations of the players into a 2-D matrix:
In [251]:
points_Real = tracking_Real[['x', 'y']].values
print(points_Real)
[[111.7  58.7]
 [100.9  50.2]
 [108.9  37.9]
 [ 91.   30.3]
 [102.4  40.6]]
  • Then, we compute the triangulations:
In [252]:
del_Real = Delaunay(tracking_Real[['x', 'y']])
  • We will create two more datasets for aiding us with annotating the jersey number of the players on their respective nodes while visualizing the players on the pitch
In [253]:
loc_Real = tracking_Real[['player_name','x', 'y']].reset_index()
loc_Liv = tracking_Liv[['player_name','x', 'y']].reset_index()
In [254]:
loc_Real
Out[254]:
index player_name x y
0 0 Cristiano Ronaldo dos Santos Aveiro 111.7 58.7
1 1 Daniel Carvajal Ramos 100.9 50.2
2 2 Karim Benzema 108.9 37.9
3 3 Toni Kroos 91.0 30.3
4 4 Francisco Román Alarcón Suárez 102.4 40.6
In [255]:
loc_Liv
Out[255]:
index player_name x y
0 0 Loris Karius 118.1 45.0
1 1 Roberto Firmino Barbosa de Oliveira 100.8 49.0
2 2 James Philip Milner 91.3 28.4
3 3 Georginio Wijnaldum 105.7 56.5
4 4 Jordan Brian Henderson 108.0 50.0
5 5 Virgil van Dijk 111.7 54.7
6 6 Trent Alexander-Arnold 105.2 35.3
7 7 Dejan Lovren 111.8 41.1
  • Finally, we visualize the triangulations and the players' positions at that instance on the pitch:
In [ ]:
pitch = Pitch(pitch_color='grass', stripe=True, line_color='white', view = 'half', figsize=(8, 9),
              constrained_layout=True, tight_layout=False, goal_type='box')
fig, ax = pitch.draw()

plt.scatter(tracking_Real.x, tracking_Real.y, color='white', s = 400, edgecolors='black', zorder=2)
plt.scatter(tracking_Liv.x, tracking_Liv.y, color='red', edgecolors='black', s = 400)

plt.triplot(points_Real[:, 0], points_Real[:, 1], del_Real.simplices.copy(), 'k-', lw = 4)

for index, row in loc_Real.iterrows():
    pitch.annotate(jerseys_Real[loc_Real['player_name'][row.name]], xy=(row.x, row.y), c ='black',
                   va = 'center', ha = 'center', size = 14, ax = ax)

for index, row in loc_Liv.iterrows():
    pitch.annotate(jerseys_Liv[loc_Liv['player_name'][row.name]], xy=(row.x, row.y), c ='black',
                   va = 'center', ha = 'center', size = 14, ax = ax)

Delaunay%20Real.png

  • The red nodes indicate locations of Liverpool's players and the white nodes indicate that of Real Madrid's. The black lines indicate the direct links between the players from a particular team at a particular moment, forming the Delaunay triangulations, also called the pass triangulations. In his book Soccematics, Dr. Sumpter mentions that these lines have two useful indications: first, they portray the availability of passes among the players from a particular team, and second, they also indicate the "no man's lines" for the players from the opposition team, meaning, if an opposition player is on one of these linking lines, then they are at a disadvantage. Beautiful implementation of computational geometry, isn't it?
  • Finally, we will compute the Voronoi diagrams for the players at the same instance on which we have just computed the Delaunay triangulations.
  • The Voronoi diagrams help us visualize the zones of each player on the pitch at a particular moment of gameplay. Mathematically, Voronoi diagrams for a set X of points, denote the partitions of a 2-D Euclidean space into regions that are close to each of these points.
  • Look at this wikipedia article to study more on Voronoi diagrams.
  • The Delaunay triangulation and the Voronoi diagrams are inter-related in the way that they are dual to each other, i.e, the circum-centers of of Delaunay triangles are the vertices of the Voronoi diagram for the set of points X. Look at the image of a Voronoi diagram (taken from here), which is the dual of the Delaunay triangulation that is shown above. voronoi.png
  • For computing the Voronoi diagrams, remember to use the data_filtered dataset, because we need the location of all the players on the pitch.
  • To compute and visualize the Voronoi diagrams, we need to import Voronoi for computing the Voronoi diagrams and voronoi_plot_2d to plot the diagrams on a pitch.
In [257]:
from scipy.spatial import Voronoi, voronoi_plot_2d
  • Next we extract the locations as points from data_filtered and compute the Voronoi diagrams:
In [258]:
data_filtered['y'] = 80 - data_filtered['y']
points = data_filtered[['x', 'y']].values
vor = Voronoi(points)
  • Finally, we visualize the computed diagrams:
In [ ]:
pitch = Pitch(pitch_color='grass', stripe=True, line_color='white', view = 'half', figsize=(8,9),
              constrained_layout=True, tight_layout=False, goal_type='box')
fig, ax = pitch.draw()

plt.scatter(tracking_Real.x, 80 - tracking_Real.y, color='white', s = 1050, edgecolors='black', zorder=2)
plt.scatter(tracking_Liv.x, 80 -tracking_Liv.y, color='red', edgecolors='black', s = 1050)

pl = voronoi_plot_2d(vor, ax=ax, show_vertices=False, line_width = 8)

for index, row in loc_Real.iterrows():
    pitch.annotate(jerseys_Real[loc_Real['player_name'][row.name]], xy=(row.x, 80 - row.y), 
                   c ='black', va = 'center', ha = 'center', size = 15, ax = ax)

for index, row in loc_Liv.iterrows():
    pitch.annotate(jerseys_Liv[loc_Liv['player_name'][row.name]], xy=(row.x, 80 - row.y), 
                   c ='black', va = 'center', ha = 'center', size = 15, ax = ax)
    

Voronoi%20teams.png

  • So, the Voronoi diagrams give us the zones of each and every player on the pitch at a particular moment by breaking the pitch into distinct regions belonging to the players indicating the field covergae of each player at that moment. This completes our section on implementational of computational geometric concepts on football event and tracking data. This completes my presentation. 😌😌😌😌😌😌😌😌😌

References¶

  • Book Soccermatics by Dr. David Sumpter,
  • Friends of Tracking youtube channel managed by Dr. Sumpter,
  • Youtube channel by McKay Johns,
  • Book Graph Theory and Complex Networks: An Introduction by Dr. Maarten van Steen, and
  • FCPython Blog

The End! Thank You! Wear Masks 😷😷, Get Vaccinated, and Stay Safe!¶